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Abstract 

We consider the design of multi-agent systems so as to optimize 
an overall world utility function when (i) those systems lack cen- 
tralized communication and control [10, 14], and (ii) each agents 
runs a distinct Reinforcement Learning (RL) algorithm [3, 11, 13]. 
A crucial issue in such design problems is to initialize/update each 
agent's private utility function, so as to induce best possible world 
utility. Traditional ‘'team game” solutions to this problem sidestep 
this issue and simply assign to each agent the world utility as its 
private utility function [6]. In previous work we used the "Col- 
lective Intelligence” framework to derive a better choice of private 
utility functions, one that results in world utility performance up 
to orders of magnitude superior to that ensuing from use of the 
team game utility [15, 17, 16, 18]. In this paper we extend these 
results. We derive the general class of private utility functions that 
both are easy for the individual agents to learn and that, if learned 
well, result in high world utility. We demonstrate experimentally 
that using these new utility functions can result in significantly 
improved performance over that of our previously proposed utility, 
over and above that previous utility’s superiority to the conven- 
tional team game utility. 


1 Introduction 

In this paper we are interested in multi-agent systems (MAS’s) where: 

• the agents each run separate reinforcement learning (RL) algorithms; 

• there is little centralized, personalized communication or control; 

• there is a provided world utility function rating possible histories of the full system. 

In such a system, we are confronted with an inverse problem: How should we ini- 
tialize/update the agents’ individual utility functions to ensure that the agents do 
not w work at cross-purposes”, so that their collective behavior maximizes the pro- 
vided world utility function? Intuitively, we need to provide the agents with utility 
functions they can learn well, while also ensuring that their doing so won’t result in 




phenomena like the Tragedy of The Commons (TOC; [8]j or D mess’ paradox [13]. 

This problem is related to work in many other fields, ineluding computational eco- 
nomies, mechanism design, reinforcement learning, computational ecologies, (par- 
tially observable) Markov decision processes and game theory. However none of 
these fields is both applicable in large, real-world problems, and directly addresses 
tin? general inverse problem. (See [1GJ for a detailed discussion of the relationship 
between these fields, involving hundreds of references.) Other previous work does 
consider the general inverse problem, and does so by employing MAS's in which 
each agent uses reinforcement learning [5]. However this work simply elects to pro- 
vide each agent with the world utility function as its private utility function (i.e., 
implements a “team” game). Unfortunately, as expounded below and in previous 
work, this approach scales to large problems very poorly. (Intuitively, the difficulty 
is that each agent can have a hard time discerning the echo of its behavior on the 
world utility when the system is large.) 

In previous work, as an alternative to the team game approach, we used the "Col- 
lective INtelligence” (COIN) framework to derive the alternative “Wonderful Life” 
private utility function (WLU) [16], and demonstrated that it markedly outper- 
forms the team game private utility in several disparate domains [15, 17, 16, 18]. 
In particular, in [18] we considered an economics “congestion” game, in particular 
a more challenging variant of Arthur’s U E1 Farol bar attendance problem” [1], also 
known as the “minority game” [4]. In this problem, agents have to determine which 
night in the week to attend a bar. The problem is set up so that if either too few 
people attend (boring evening) or too many people attend (crowded evening), the 
total enjoyment of the attendees drops. Note the built-in frustration effect that if 
all agents could predict attendance perfectly, they would all make the same atten- 
dance choice, and total enjoyment would be minimal. The goal is to avoid this by 
designing the reward functions of the attendees so that the total enjoyment across 
all nights is maximized. Our results indicate that use of the WLU can result in 
performance orders of magnitude superior to that of team game utilities [18]. 

The WLU has a free parameter (the “clamping parameter”), which we simply set 
to 0 in our previous work. In this paper we employ a series of approximations to 
derive a theoretically optimal value of the clamping parameter, and demonstrate 
the empirical superiority of that value in computer experiments. To derive the op- 
timal value we must employ some of the mathematics of COINs, whose relevant 
concepts we review in the next section. We next use those concepts to sketch the 
calculation deriving the optimal clamping parameter. Our experiments involved 
the Bar problem, whose detailed setup is discussed in Section 3- Finally we present 
the results of the experiments in Section 4. Those results corroborate the predicted 
improvement in performance when using our theoretically derived clamping param- 
eter. This extends even further the superiority of the COIN-based approach above 
that of conventional team-game approaches. 


2 Theory of COINs 

In this section we summarize that part of the mathematics of COINs that is rel- 
evant to the study in this article. We consider the state of the system across a 
set of consecutive time steps, t 6 {0, 1, Without loss of generality, all relevant 
characteristics of agent rj at time t — including its internal parameters at that time 
as well as its externally visible actions — are encapsulated by a Euclidean vector 
C f , the state of agent rj at time t. £ t is the set of the states of all agents at £, and 

C is the system’s worldline, i.e., the state of all agents across all time. 


World utility is 6'(CL and wln*n q is an RL algorithm “striving to increase” its 
private utility, wf? write that utility as 7^(0- (The mathematics can readily be 
generalized beyond such RL- based agents; see [IG] for details.) Here we restrict 
attention to utilities of the form ^ /?<(C f ) for reward functions R t . 

We are interested in systems whose dynamics is deterministic. (This covers in 
particular any system run on a digital computer, even one using a pseudo-random 
number generator to generate apparent stochasticity.) We indicate that dynamics 
by writing C = C(£ Q ). So all characteristics of an agent q at t = 0 that affects the 

ensuing dynamics of the system, including its private utility, are included in 
Definition: A system is factored if for each agent q individually, 

7„(C(C 0 )) > 7,(C(C 0 )) «> G(C(f 0 )) > C(C(C 0 )) , 

for all pairs C and that differ only for node q. 

For a factored system, when every agents’ private utility is optimized (given the 
other agents’ behavior), world utility is at a critical point (e.g., a local maximum) 
[16]. In game-theoretic terms, optimal global behavior occurs when the agents’ are 
at a private utility Nash equilibrium [7], Accordingly, there can be no TOC for a 
factored system [16, 17, 18].) In addition, off of equilibrium, the private utilities in 
factored systems are “aligned” with the world utility. 

Definition: The (t = 0) effect set of node q at C> 5^(0* * s the set of all 
components C , v for which the gradients (C(C 0 )V,r / 0. with no 

specification of £ is defined as U^cS^ f (Q- We will also find it useful to define 
~S* f f as the set of all components that are not in SJj^. 

Intuitively, the t = 0 effect set of q is the set of all node-time pairs which, under the 
deterministic dynamics of the system, are affected by changes to q's t = 0 state. 

Theorem: A system is factored at all £ E C iff for all those V/?, we can write 

7,(£) = *,(C. 5 ./,.G(0) (i) 

for some function $«(., •) such that t ■/ , G) > 0 for all £ € C and associated 

— o,, 

G values (the form of the {7^} off of C is arbitrary). (Proof in [16].) 

Definition: Let a be a set of agent-time pairs. CL a (0 is C modified by “clamping” 
the states corresponding to the elements of a to some arbitrary pre-fixed vector k. 
Then the (effect set) Wonderful Life Utility for node q (at time 0) is WLU n (Q = 

G( 0 - G{CL s *ff (Q), where conventionally k = 5. 

Note the crucial fact that to evaluate the WLU one does not need to know how to 
calculate the system’s behavior under counter-factual starting conditions. All that 
is needed to evaluate WLU n is the function G(.), the actual and (which cam 
often be well-approximated even with little knowledge of C). 

Since GiCL^nlQ)) is a function only of by Thm. 1 we know that WLU 

is factored. As another example, if 7^ = G V77 (a team game), then the system is 
factored, in this case regardless of C. However for large systems where G sensitively 
depends on all components of the system, each agent may experience difficulty 
discerning the effects of its actions on G. As a consequence, each q may have 
difficulty achieving high in a team game. We can quantify this signal/noise effect 


by comparing flu* ramifications on 7 ^ ( = C(C 0 )) arising from changes to (, r 0 with 
th<' ramifications arising from changes to (N , where 7/ represents all nodes other 

than if. We rail this quantification learnahility [ LG] . A linear approximation to 
tin* learnahility in the vicinity of g is the differential learnahility fC): 


Af/,7,, (C) — 


liv, vo7 , ) (c(c 0 ))ll ' 


( 2 ) 


It can he proven that in many circumstances, especially in large problems, WLU 
has much higher differential learnahility than does the team game choice of private 
utilities [16]. (Intuitively, this is due to the subtraction occurring in the WLUs 
removing a lot of the noise.) The result is that convergence to optimal G with 
WLU is much quicker (up to orders of magnitude so) than with a team game. 

However the equivalence class of utilities that are factored for a particular G is not 
restricted to the associated team game utility and clamp-to -0 WLU. Indeed, one 
can consider solving for the utility in that equivalence class that maximizes differ- 
ential learnahility. An approximation to this calculation is to solve for the factored 
utility that minimizes the expected value of [^r) % w LR n \~~ , where the expectation is 
over the values that, while fixed, are not known to the system designer. (As 

an example, algorithms using pseudo-random number generators are deterministic, 
strictly speaking, but are effectively stochastic to the system designer.) 

A number of further approximations — too long to go through here — have to 
be made to complete this calculation. The final answer can be approximated as a 
WLU, where k ^ 0, but rather equals the expected Sjj^U Now in the experiments 
recounted below S e n ^ is approximated as the sequence of q s successive actions 
(i e M the approximation is made that to first order, tj's actions have no effects 
on the actions of other agents). Furthermore, for simplicity, we do not actually 
clamp each q separately to its own average action sequence, which would involve 
modifying WLU n in an online manner. Rather we clamp all agents to the same 
average action. We then made the guess that the typical probability distribution 
over actions is uniform. (Intuitively, we would expect such a choice to be more 
accurate at early times than at later times in which agents have ’‘specialized’'.) 


3 The Bar Problem 

We focus on the following six more general variants of the bar problem investigated 
in [18]: There are N agents, each picking one out of seven actions every week. Each 
action corresponds to attending the bar on some particular set of / € { 1 , 2 , 3 , 4 , 5 , 6 } 
out of the seven nights of the current week, i.e., given I, each action is a vertex of the 
7-dimensional unit hypercube having / Us. At the end of the week the agents get 
their rewards and the process is repeated. For simplicity we chose the attendance 
profiles of each potential action so that when the actions are selected uniformly the 
resultant attendance profile across all seven nights is also uniform. 

w °rld utility is G(£) = Et R o( where Ro(£ t ) = £l=i $(**(£,*)). **(£,*) fc 

the total attendance on night k at week t } <p(y) = 7/ exp {-yfc)\ and c is a real- 
valued parameter. (To keep the “congestion” level constant, for / going from 1 to 6 
c = (3,6,8, 10, 12, 15} respectively.) Our choice of means that when either too 
few or too many agents attend some night in some week world reward Re is low. 

Since we are concentrating on the utilities rather than on the RL algorithms that 


use them, we use (very) simple R L algorithms. Each agent // h;is a 7-dimensional 
vector giving its estimates of the reward it, would receive for taking each possible 
action. At the beginning of each week, each tj picks the night to attend randomly, 
using a Boltzmann distribution over the seven components of tfs estimated rewards 
vector. For simplicity, temperature does not decay in time. However to reflect the 
fact that each agent operates in a non-stationary environment, reward estimates 
are formed using exponentially aged data: in any week t, the estimate q makes 
for the reward for attending night i is a weighted average of all the rewards it 
has previously received when it attended that night, with the weights given by an 
exponential function of how long ago each such reward was. To form the agents 1 
initial training set, we had an initial period in which all actions by all agents were 
chosen uniformly randomly, with no learning. 


4 Experimental Results 


We investigate three choices of iZ: 0, f = (1, 1, 1, 1, 1, 1, 1), and the “average” action, 
a = 1/7. The associated WLU’s are distinguished with a superscript. In the 
experiments reported here ail agents have the same reward function, so from now 
on we drop the agent subscript from the private utilities. Writing them out, the 
three WLU’s provide the following reward functions: 


= Rc(i' t )-Rc(CL*( c,)) 

= ■ < * > d 1 ( x d n { C,0) — 0 — 1) 

*W(C t ) = Ra(i t )-Rc(CLl(£ t )) 


RwL*« t ) 


t 

= ^o d (^d(C,0) - <Pd : u(i,t) + i) 

d*d n 

= R G (Q t )-Ra{CL^Q t )) 

7 

= + a d ) 

d*d n 

+ <t>d n {xd*{Ct)) - 


where d n is the night picked by q and a d - 1/7. The team game reward function is 
simply Re- Note that to evaluate R WL s each agent only needs to know the total 
attendance on the night it attended. In contrast, Re and Rwl* require centralized 
communication concerning all 7 nights, and R W ii requires communication concern- 
ing 6 nights. Finally, note that when when viewed in attendance space rather than 
action space, CL s is clamping to the attendance vector v \ = J2 d =i where u dt i 
is the fth component (0 or 1) of the the d’th action vector. So for example, for 
l- 1, CL 5 clamps to v t = £^ =l where S dyi is the Kronecker delta function. 

The results we report in this section are averages over 20 runs with 60 agents, and 
throughout this article the error bars are too small to depict. Figure 1(a) shows the 
normalized world reward obtained for the different private utilities as a function of 
[, R\yi* performs well for all problems. on the other hand performs poorly 

when agents only attend on a few nights, but reaches the performance of Rwl* w hen 
agents need to select six nights, a situation where the two clamping vectors are very 

similar (f and j y respectively). R Wi <s shows a slight drop in performance when the 
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Figure 1: (a) Behavior of different reward function with respect to number of nights 
to attend, (b) Sensitivity of reward functions to internal parameters. (In both 
figures, WL 5 is O ; WLP is 4- ; WL r is □ ; G is x) 
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number of nights to attend increases, while Rq shows a much more pronounced drop. 
Furthermore, in agreement with our previous results [18], despite being factored, 
the poor signal-to-noise in Rq results in poor performance with it for all problems. 
(Temperatures varied between .01 and .02 for the three WL rewards, and between .1 
and 2 for the G reward, which provided the respective best performances for each.) 
These results confirm our theoretical prediction of what private utility converges 
fastest to the world utility maximum. 

We a?so studied the sensitivity of performance to the internal parameters of the 
learning algorithms. Figure 1(b) presents experiments with l = 1 for a set of differ- 
ent temperatures in the RL algorithms. (The two straight lines correspond to the 
optimal performance, and the “baseline” performance given by uniform occupancies 
across all nights.) R WL s is fairly insensitive to the temperature, until it gets so high 
that agents' actions are chosen almost randomly. R WL $ depends more than Rwl 3 
does or having sufficient exploration and therefore has a narrower range of good 
temperatures. Both R WL r and Rg have more serious learnabilitv problems, and 
therefore have shallower and thinner performance graphs. 


5 Conclusion 

In this article we considered how to configure large multi-agent systems where 
each agent uses reinforcement learning, and where there is no personalized (agent- 
specific) centralized communication and control. The inverse problem associated 
with such systems is how to initialize/update the individual agents’ private utility 
functions so that their collective behavior optimizes a pre-specified world utility 
function. The mathematics of COINs is specifically designed for this problem, and 
in previous experiments systems based on it have far outperformed conventional 
’’team game” systems, in which each agent has the world utility as its private util- 
ity function. Moreover, the gain in performance grows with the size of the system, 
typically reaching orders of magnitude for systems that consist of hundred of agents. 

In those previous experiments the COIN-based private utilities had a free parameter, 
which we set to 0. However as we synopsised in this paper, a series of approxima- 
tions in the mathematics of COINs allows one to derive an optimal value for that 





parameter \VV tln*u repeated some of our previous romptih*r experiments, only 
using this now value for the parameter. These experiments confirm that, with this 
now value the system converges to significantly superior worhl utility values, with 
[ess sensitivity to the parameters of the agents’ RL algorithms. This makes even 
stronger the arguments for using a C'OlN-based system rather than a team-game 
system Future work involves improving the approximations needed to calculate the 
optimal private utility parameter value. In particular, given that that value varies 
in time, we intend to investigate having it be calculated in an on-line manner. 
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